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Amendments to the Claims : 

This listing of claims replaces all prior versions and listings of claims in the 
application: 

Listing of Claims : 

Claims 1-54 (Canceled). 

55. A computer- implemented method for identifying compounds in text, 
comprising: 

extracting a vocabulary of tokens from text; 
iterating from n > 2 down to n = 2 where n decreases by one each 
iteration and in each iteration performing the actions of: 

identifying a plurality of unique n-grams in the text, each n-gram 
being an occurrence in the text of n sequential tokens, each token being found in 
the vocabulary; 

dividing each «-gram into n-\ pairs of two adjacent segments, 
where each segment consists of at least one token; 

for each n-gram, calculating a likelihood of collocation for each 
pair of segments of the n-gram and determining a score for the n-gram based on 
a lowest calculated likelihood of collocation; 

identifying a set of «-grams having scores above a threshold; and 

adding the identified set of n-grams as compound tokens to the 
vocabulary and removing constituent tokens that occur in the added compound 
tokens from the vocabulary. 



56. The method of claim 55 where calculating a likelihood of collocation for each 
pair of segments of the n-gram comprises determining a likelihood ratio X for 
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each pair of segments that is computed in accordance with the formula: 

where L(Hj) is a likelihood of observing Hj under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and H is a 
pair of segments. 

57. The method of claim 56 where the L(H C ) is computed for each pair of segments, 

t\, h, in each «-gram in accordance with the formula: 

Lit, ,t 0 form compound) 

arg max —. i L 

l(h,) L\n- gram does not form compound) 

58. The method of claim 56 where, for each pair of segments, h, t 2 , in each rc-gram, 
the independence hypothesis comprises P(t 2 \ t x ) = p(t 2 \ t x ) and the collocation 
hypothesis comprises P(t 2 \ t : )>p{t 2 1 fj. 

59. The method of claim 55 where identifying a plurality of unique M-grams in the 
text comprises skipping «-grams appearing in a list of known compounds. 

60. A computer program product, encoded on a computer-readable medium, 
operable to cause data processing apparatus to perform operations comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down ton = 2 where n decreases by one each 
iteration and in each iteration performing the actions of: 

identifying a plurality of unique n-grams in the text, each n-gram 
being an occurrence in the text of n sequential tokens, each token being found in 
the vocabulary; 

dividing each «-gram into n-l pairs of two adjacent segments, 
where each segment consists of at least one token; 
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for each n-gram, calculating a likelihood of collocation for each 

pair of segments of the n-gram and determining a score for the n-gram based on 

a lowest calculated likelihood of collocation; 

identifying a set of n-grams having scores above a threshold; and 
adding the identified set of n-grams as compound tokens to the 

vocabulary and removing constituent tokens that occur in the added compound 

tokens from the vocabulary. 

61. The program product of claim 60 where calculating a likelihood of collocation 
for each pair of segments of the n-gram comprises determining a likelihood 
ratio X for each pair of segments that is computed in accordance with the 
formula: 

W7) 

where L(Hj) is a likelihood of observing H, under an independence hypothesis, 
L(H C ) is a likelihood of observing H c under a collocation hypothesis, and H is a 
pair of segments. 

62. The program product of claim 61 where the L(H C ) is computed for each pair of 

segments, t\, t 2 , in each «-gram in accordance with the formula: 

Lit, ,t 7 form compound) 

arg max — — — r . 

l(h,) L\n - gram does not form compound) 

63. The program product of claim 61 where, for each pair of segments, h, t 2 , in each 
n-gram, the independence hypothesis comprises P(t 2 \t x ) = p(t 2 | t x ) and the 
collocation hypothesis comprises P(t 2 | ^)> p{t 2 \ fj. 
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64. The program product of claim 60 where identifying a plurality of unique di- 
grams in the text comprises skipping «-grams appearing in a list of known 
compounds. 

65. A system comprising: 

a computer readable medium including a program product; and 

one or more processors configured to execute the program product and 
perform operations comprising: 

extracting a vocabulary of tokens from text; 

iterating from n > 2 down to n = 2 where n decreases by one each 
iteration and in each iteration performing the actions of: 

identifying a plurality of unique «-grams in the text, each «-gram 
being an occurrence in the text of n sequential tokens, each token being found in 
the vocabulary; 

dividing each n-gram into n-\ pairs of two adjacent segments, 
where each segment consists of at least one token; 

for each n-gram, calculating a likelihood of collocation for each 
pair of segments of the n-gram and determining a score for the n-gram based on 
a lowest calculated likelihood of collocation; 

identifying a set of «-grams having scores above a threshold; and 

adding the identified set of n -grams as compound tokens to the 
vocabulary and removing constituent tokens that occur in the added compound 
tokens from the vocabulary. 

66. The system of claim 65 where calculating a likelihood of collocation for each 
pair of segments of the n-gram comprises determining a likelihood ratio X for 
each pair of segments that is computed in accordance with the formula: 

where L(Hi) is a likelihood of observing H t under an independence hypothesis, 
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L{H C ) is a likelihood of observing H c under a collocation hypothesis, and H is a 
pair of segments. 

67. The system of claim 66 where the L{H C ) is computed for each pair of segments, 

t\, t 2 , in each «-gram in accordance with the formula: 

Lit, ,t 0 form compound) 

arg max — -, — — r . 

l(h,) L\n - gram does not form compound) 

68. The system of claim 66 where, for each pair of segments, h, h, in each n-gram, 
the independence hypothesis comprises P(t 2 \t l )= p(t 2 \ t l ) and the collocation 
hypothesis comprises P(t 2 \t l )> | t x ). 

69. The system of claim 65 where identifying a plurality of unique «-grams in the 
text comprises skipping n-grams appearing in a list of known compounds. 



